Introduction to R is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.
The structure of this course is a code-along style - it is 100% hands on! A few hours prior to each lecture, links to the materials will be avaialable for download at QUERCUS. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with R by coding along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into R - when required. This learning approach will allow you to spend the time coding and not taking notes!
As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark) through DataCamp to help cement and/or extend what you learn each week.
We'll take a blank slate approach here to R and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:
A pile of data (like an excel file or tab-separated file) full of experimental observations and you don't know what to do with it.
Maybe you're manipulating large tables all in excel, making custom formulas and pivot table with graphs. Now you have to repeat similar experiments and do the analysis again.
You're generating high-throughput data and there aren't any bioinformaticians around to help you sort it out.
You heard about R and what it could do for your data analysis but don't know what that means or where to start.
and get you to a point where you can:
Format your data correctly for analysis
Produce basic plots and perform exploratory analysis
Make functions and scripts for re-analysing existing or new data sets
Track your experiments in a digital notebook like Jupyter!
In the first two lessons, we will talk about the basic data structures and objects in R, get cozy with the RStudio environment, and learn how to get help when you are stuck. Because everyone gets stuck - a lot! Then you will learn how to get your data in and out of R, how to tidy our data (data wrangling), subset and merge data, and generate descriptive statistics. Next will be data cleaning and string manipulation; this is really the battleground of coding - getting your data into the format where you can analyse it. After that, we will make all sorts of plots for both data exploration and publication. Lastly, we will learn to write customized functions and apply more advanced statistical tests, which really can save you time and help scale up your analyses.
The structure of the class is a code-along style: It is fully hands on. At the end of each lecture, the complete notes will be made available in a PDF format through the corresponding Quercus module so you don't have to spend your attention on taking notes.
This is the final in a series of seven lectures. Last lecture we explored the realm of statistical analyses with linear regression and other general linear models. Now we arrive at the final destination, addressing how to create looping and branching code, as well as our own functions in the topic of control flow. At the end of this session we will have covered:
Grey background: Command-line code, R library and function names... fill in the code here if you are coding alongEach week, new lesson files will appear within your JupyterHub folders. We are pulling from a GitHub repository using this Repository git-pull link. Simply click on the link and it will take you to the University of Toronto JupyterHub. You will need to use your UTORid credentials to complete the login process. From there you will find each week's lecture files in the directory /2021-09-IntroR/Lecture_XX. You will find a partially coded skeleton.ipynb file as well as all of the data files necessary to run the week's lecture.
Alternatively, you can download the Jupyter Notebook (.ipynb) and data files from JupyterHub to your personal computer if you would like to run independently of the JupyterHub.
A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!
As mentioned above, at the end of each lecture there will be a completed version of the lecture code released as a PDF file under the Modules section of Quercus. A recorded version of the lecture will be made available through the University's MyMedia website and a link will be posted in the Discussion section of Quercus.
Today we'll be keeping it simple by working with a dataset to help us demonstrate the power of looping, and user-defined functions.
We'll be working with this dataset to help us work through the different aspects of control flow.
We'll be using this source file later to show how you can save your own functions and import them for data analysis.
The following packages are used in this lesson:
tidyverse (tidyverse installs several packages for you, like dplyr, readr, readxl, tibble, and ggplot2). In particular we will be taking advantage of the stringr package this week.
viridis our colour-blind friendly package for providing specific colour palettes to our visualizations
Some of these packages should already be installed into your Anaconda base from previous lectures. If not, please review that lesson and load these packages. Remember to please install these packages from the conda-forge channel of Anaconda.
conda install -c conda-forge r-biocmanager
BiocManager::install("limma")
conda install -c conda-forge r-gee
conda install -c conda-forge r-multcomp
#--------- Install packages to for today's session ----------#
# install.packages("tidyverse", dependencies = TRUE) # This package should already be installed on Jupyter Hub
#--------- Load packages to for today's session ----------#
library(tidyverse)
library(viridis)
Warning message: "package 'tidyverse' was built under R version 4.0.5" -- Attaching packages --------------------------------------- tidyverse 1.3.1 -- v ggplot2 3.3.3 v purrr 0.3.4 v tibble 3.1.1 v dplyr 1.0.6 v tidyr 1.1.3 v stringr 1.4.0 v readr 1.4.0 v forcats 0.5.1 Warning message: "package 'ggplot2' was built under R version 4.0.5" Warning message: "package 'tibble' was built under R version 4.0.5" Warning message: "package 'tidyr' was built under R version 4.0.5" Warning message: "package 'dplyr' was built under R version 4.0.5" Warning message: "package 'forcats' was built under R version 4.0.5" -- Conflicts ------------------------------------------ tidyverse_conflicts() -- x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag() Loading required package: viridisLite
Although we have only briefly touched on some of the aspects regarding control flow, it has been implemented behind the scenes in many of the functions you've used throughout this course. From your experience in Jupyter Notebooks, the order in which a code cell's individual statements or instructions are executed can be considered part of control flow. Expanding on this idea, when you see the number order of the code cells, this also indicates the control flow of the entire notebook or program. Once a code cell is run, the objects it has generated remain stored in memory and available for access.
Within our code cells and overall program, control flow can involve statements that help to generate choice loops, conditional statements, and move throughout the program. These specific statements allow us to run different blocks of code at different times. This can be accomplished through
In this lecture, we'll touch on all of these concepts to give you a taste of how you can make your programs accomplish more with less actual code. Let's start by loading up an example dataset to play around with.
# set working directory
getwd()
list.files("./data")
# read our file in with read_csv()
# compounds_data <- read.csv("data/compounds_stats.csv", header = TRUE,
# stringsAsFactors = FALSE, check.names = FALSE)
compounds_data <- read_csv("data/compounds_stats.csv",
col_names = TRUE)
# explore our loaded data frame
head(compounds_data)
-- Column specification -------------------------------------------------------- cols( compound = col_character(), salinity = col_character(), group = col_character(), day = col_double(), mean_methane = col_double(), sd_methane = col_double() )
| compound | salinity | group | day | mean_methane | sd_methane |
|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> |
| substrate_free | brackish | media_only | 1 | 0 | 0 |
| substrate_free | fresh | media_only | 1 | 0 | 0 |
| substrate_free | saline | media_only | 1 | 0 | 0 |
| benzene | brackish | sterile | 1 | 0 | 0 |
| hexane | brackish | sterile | 1 | 0 | 0 |
| toluene | brackish | sterile | 1 | 0 | 0 |
for() loops to repeat commands for a maximum number of iterations¶R doesn't care if you write the same code 1000 times or have the interpreter repeat a single copy 1000 times. However, the second is a lot easier for you. The for() loop helps to reduce code replication by compartmentalizing a set of instructions to repeat instead of copying and pasting the same code several times.
More specifically, a for() loop executes a statement repetitively until a well-defined endpoint. In this case, it determines when a specific variable's value is no longer contained in a given sequence.
For example, let's say that we want to add a + 2 10 times and overwrite it everytime:
# Increment a by 2, the bad way...
a <- 2
a <- a+2
a <- a+2
a <- a+2
a <- a+2
a <- a+2
a <- a+2
a <- a+2
a <- a+2
a <- a+2
a <- a+2
a
Sure, 10 times is doable by hand, just copy-paste. But what if you need to perform that same task, say 1,000 times? What if the code was more complex than a <- 2? That is when for() loops come to the rescue.
# Increment 'anything' using a for loop
anything = 10
# Set up your for loop with a 'tally' count
for (tally in 1:1000) {
anything <- anything + 2
}
tally
anything
for loop can be described in three stages¶for(x in y): Set a variable to equal the next value in a sequence { code to run } Run a set of code with that variable at that valueThere are a number of ways to set the counting variable within the for() initialization. In reality, you just need to supply a vector of elements for it to iterate through. This could be a sequence where y is defined as a:b, or a numeric vector, or even a vector of objects! Each of these is assigned to x in our loop and must be used appropriately.
Note that without {...} enclosing your code, R will run only the first statement right after the for() call. This can exist on the same line, or on the next line. Subsequent lines, regardless of indentation, will not be run as part of the loop. This behaviour lets you quickly build a simple for() loop or you can extend the behaviour to accomplish many or more complex tasks.
Let's take a look at the seq() function and how you can use it within a for() loop.
# Use the seq() function
seq(from = 1, to = 10, by = 0.5)
# let's use seq() in a for loop to count, no braces but indentation
for(variable in seq(1,10, 0.5))
print(variable)
print("middle but not really")
print("This is the end")
[1] 1 [1] 1.5 [1] 2 [1] 2.5 [1] 3 [1] 3.5 [1] 4 [1] 4.5 [1] 5 [1] 5.5 [1] 6 [1] 6.5 [1] 7 [1] 7.5 [1] 8 [1] 8.5 [1] 9 [1] 9.5 [1] 10 [1] "middle but not really" [1] "This is the end"
# for loop on a single line
for(variable in seq(1, 10, 0.5)) print(variable); print("This is the end")
[1] 1 [1] 1.5 [1] 2 [1] 2.5 [1] 3 [1] 3.5 [1] 4 [1] 4.5 [1] 5 [1] 5.5 [1] 6 [1] 6.5 [1] 7 [1] 7.5 [1] 8 [1] 8.5 [1] 9 [1] 9.5 [1] 10 [1] "This is the end"
# for loop on a single line, with brackets
for(variable in seq(1, 10, 0.5)) {print(variable); print("This is the end")}
[1] 1 [1] "This is the end" [1] 1.5 [1] "This is the end" [1] 2 [1] "This is the end" [1] 2.5 [1] "This is the end" [1] 3 [1] "This is the end" [1] 3.5 [1] "This is the end" [1] 4 [1] "This is the end" [1] 4.5 [1] "This is the end" [1] 5 [1] "This is the end" [1] 5.5 [1] "This is the end" [1] 6 [1] "This is the end" [1] 6.5 [1] "This is the end" [1] 7 [1] "This is the end" [1] 7.5 [1] "This is the end" [1] 8 [1] "This is the end" [1] 8.5 [1] "This is the end" [1] 9 [1] "This is the end" [1] 9.5 [1] "This is the end" [1] 10 [1] "This is the end"
for() loops¶As was mentioned at the start of this section, under the hood, many of the functions that we commonly use are just for() loops. We can easily replicate them with explicit for loops but it takes up extra coding time! For example, we can replicate the rep() function.
# Use the rep() function to print the number 1-5, 8 times
rep(x = 1:5, times = 8)
Let's duplicate the function of rep() with a for() loop!
# for loop version variables need to be set
rm(result)
x <- 1:5
n <- 8
result <- x # What happens if we remove this line?
# Build our for loop
for (i in 1:(n-1)){
result <- c(result, x)
print(result)
}
result
i
Warning message in rm(result): "object 'result' not found"
[1] 1 2 3 4 5 1 2 3 4 5 [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 [39] 4 5
Why did we declare result <- x ahead of the for loop? It can get a little complicated but for our purposes, we can say that the offending issue lies within the for loop itself result <- c(result, x). Remember, when the kernel encounters this command, it tries to evaluate the right side of the assignment first. When it goes to look for result it does not exist and cannot complete the assignment. To help facilitate this, we need to declare result outside the loop.
There are a few ways we could do this such as with result <- NULL just so that it exists as an initialized placeholder. Instead we assigned it initially to hold the first iteration of our sequence. Either would have worked but would require different numbers of loop iterations.
If you declared result <- NULL or result <- x within the loop, it would repeat this command with every iteration, thus overwriting it back to a native state with each loop. Nothing would progress! We'll use this concept to springboard us into the idea of scope.
Control flow statements as with other compartmentalized sections of code can be thought of as separate rooms in a house or sandboxes in a playground.
Thus a variable is either global or local in scope. If it is local, then the information about it simply disappears at the end of the function or control flow. The scope of a variable can usually be considered as between the {...} of a programming section. After you've left that section, anything explicitly declared within (ie new variables from that section) will be released from memory. Of course, R doesn't exactly play by those rules, and stray variables can float in memory. If you want to ensure that variables from something like a for loop remain local, you can use the local() command or create a function().
Why is scope important?
Understanding this concept will save you a lot of troubles down the road as you make more and more complex programs. You'll learn to avoid declaring variables in the wrong place, or trying to access ones that no longer exist in your scope. Let's revisit our example from above.
# for loop version variables need to be set
rm(result, j)
x <- 1:5
n <- 8
result <- 100
# Build a local for loop
local(
for (i in 1:n){
result <- c(result, x)
print(result)
j = result # assign a value to j
}
)
result
i
j
Warning message in rm(result, j, i): "object 'j' not found"
[1] 100 1 2 3 4 5 [1] 100 1 2 3 4 5 1 2 3 4 5 [1] 100 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 [1] 100 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 [20] 4 5 [1] 100 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 [20] 4 5 1 2 3 4 5 [1] 100 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 [20] 4 5 1 2 3 4 5 1 2 3 4 5 [1] 100 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 [20] 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 [1] 100 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 [20] 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 [39] 3 4 5
Error in eval(expr, envir, enclos): object 'i' not found Traceback:
local() scope isolates your code from the global environment¶What happened to our variable result? You can see that it was initially declared as the value of 100. When we entered the local() scope and then had the first iteration of our for() loop the code result <- c(result,x) looked locally first for the values of result and x but these variables did not exist so it pulled the values from the global environment. Subsequently a local result variable was then declared and assigned a value. This local version of result was updated with each iteration but the global version was never altered.
A similar effect is seen when creating and using your own functions (to be discussed) but you can see that the kernel searches for variables (and functions) in the local namespace before checking the global namespace, followed by the namespaces of the loaded packages.
for() loop¶The most useful thing to do with a for loop is to cycle through values. Let's return to compounds_data and plot methane per day of each individual row using base R's plotting functions (instead of ggplot).
# Pull down the structure and colnames of our compounds_data
str(compounds_data)
colnames(compounds_data)
spec_tbl_df[,6] [480 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ compound : chr [1:480] "substrate_free" "substrate_free" "substrate_free" "benzene" ... $ salinity : chr [1:480] "brackish" "fresh" "saline" "brackish" ... $ group : chr [1:480] "media_only" "media_only" "media_only" "sterile" ... $ day : num [1:480] 1 1 1 1 1 1 1 1 1 1 ... $ mean_methane: num [1:480] 0 0 0 0 0 0 0 0 0 0 ... $ sd_methane : num [1:480] 0 0 0 0 0 0 0 0 0 0 ... - attr(*, "spec")= .. cols( .. compound = col_character(), .. salinity = col_character(), .. group = col_character(), .. day = col_double(), .. mean_methane = col_double(), .. sd_methane = col_double() .. )
# grab the list of days from our dataset
days = unique(compounds_data$day)
# Create a for loop to go through and incrementally print portions of the data
for(i in seq(4, length(days), by=4)) {
# Create a plot
plot <-
compounds_data %>%
# Filter based on the days in the set
filter(day %in% days[1:i]) %>%
# Build a plot with mean_methane vs day
ggplot(.) +
# 2. Aesthetics
aes(x = day, y = mean_methane, colour = day) +
labs(title = paste0("Mean methane for day 1 to ", i)) + # Add a title based on the day range
guides(colour = "none") +
# 3. Scaling
# Scale the colour scheme displayed based on the total date range
scale_colour_viridis(option = "magma", begin = 0, end = (1/length(days)*i)) +
xlim(0, 560) + # Preset the x and y limits
ylim(0, 200) +
# 4. Geoms
geom_point()
suppressWarnings(print(plot)) # Drop the warnings when we print the plot
Sys.sleep(2) # Pause the system for 2 seconds
}
# What is the total number of rows in our data?
total_rows = nrow(compounds_data)
# Create a for loop and add rows of data to the plot incrementally
for(i in seq(total_rows/4, total_rows, by=total_rows/4)) {
# Create a plot
plot <-
# Build a plot with mean_methane vs day
ggplot(compounds_data[1:i,]) +
# 2. Aesthetics
aes(x = day, y = mean_methane, colour = day) +
labs(title = paste0("Mean methane for day 1 to ", i)) + # Add a title based on the day range
guides(colour = "none") +
# 3. Scaling
# Scale the colour scheme displayed based on the total date range
scale_colour_viridis(option = "magma") +
xlim(0, 560) + # Preset the x and y limits
ylim(0, 200) +
# 4. Geoms
geom_point()
suppressWarnings(print(plot)) # Drop the warnings when we print the plot
Sys.sleep(2) # Pause the system for 2 seconds
}
for() loop¶Another handy feature of the for() loop in R is being able to directly give the loop a vector to iterator through until there are no elements left. This will come in handy when applying the same transformations, functions, or calculations on different subsets or elements within a vector.
We'll start with a simple example of looping through a small character vector.
# for loop in a single line, with brackets
for(variable in c("I", "You", "We all")) {
print(variable)
print("scream;")
}
print ("for ice cream")
[1] "I" [1] "scream;" [1] "You" [1] "scream;" [1] "We all" [1] "scream;" [1] "for ice cream"
Lets use a t.test() to look for methane production differences between the salinity levels of every day ending in a 2 (starting at day 2), excluding the "brackish" salinity group.
# subset dataset every day that ends in a 2 and exclude brackish
# %% is a quick math symbol for modulo (return a remainder)
subdata <- compounds_data %>%
filter(...,
salinity != "brackish")
# create an empty data frame to store the output of the for loop
result <- ...(day = unique(subdata$day),
difference = NA,
p_value = NA)
result
# for loop to calculate difference in means between fresh vs saline groups in salinity
for(i in ...) {
# Generate a t-test on subset by day
t <- t.test(mean_methane ~ salinity, subdata[subdata$day == i, ])
# write the results to our data frame
result[result$day == i, "difference"] <- diff(t$estimate)
result[result$day == i, "p_value"] <- t$p.value
}
# Show the final updated data frame
result
tapply()¶Let's replicate the tapply() function although we don't need it to have the same formatting. tapply() applies "a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors". We'll use it on our dataset compounds_data and save the result in compound_mean.
# take alook at the apply family of functions R
?tapply
head(compounds_data)
# use tapply to get the mean_methane value when grouping by compound
tapply(compounds_data$...,
compounds_data$...,
mean,
na.rm = TRUE)
Now that we see how tapply() works let's mimic this function using a for loop. Note that tapply() returned an array to us but we'll save our results in a data frame since we're more familiar with these.
# mimic tapply using for loop
# First build a data frame to hold our result
# Each row will represent a unique compound
compound_mean <- data.frame("compound" = unique(compounds_data$compound),
"mean_methane" = NA)
for(i in ...) {
# iterate through compound names
curr_compound <- compound_mean[...] # this could also be written as compound_mean$compound[i]
# calculate the mean of all samples that share the same compound name.
compound_mean[...] <- mean(compounds_data$mean_methane[compounds_data$compound == curr_compound],
na.rm = TRUE)
}
# Print our result
compound_mean
# mimic tapply using for loop
# Let's make it using a named vector
# Each row will represent a unique compound
compound_mean_vector = vector()
# iterate directly through the compound names
for(i in unique(compounds_data$compound)) {
# add their mean to the vector
compound_mean_vector <- c(compound_mean_vector, # Add to our vector
setNames(mean(compounds_data$mean_methane[compounds_data$compound == i],
na.rm = TRUE), # Calculate the mean
i) # Set the element name to the compound name
)
}
# Print our result but also sort it based on the element names
compound_mean_vector[order(names(compound_mean_vector))]
if() statements¶One of the big advantages of programming is to have conditional statements in your code. R can make binary decisions like "if data meets a condition, do this". Some of these happen implicitly as in a for() loop but you can also declare these decision branches explicity.
The if() (conditional argument) evaluates statements that are either FALSE or TRUE. The general format is
if (boolean expression) {
// statement(s) will execute if the boolean expression is true.
}
# Practice with an if() statement
x <- c("what", "is", "truth")
if(...) {
print("Truth is found")
}
else() statement¶Now that we know how to use if() statements, what if we want to give a second instruction based on the outcome of the if() statement? The else() and else if() statements exist to extend the conditional branch through additional considerations. In general, the structure looks like this:
if(boolean_expression #1) {
// statement(s) will execute if the boolean expression #1 is TRUE.
} else if (boolean_expression #2) {
// statement(s) will execute if the new boolean expression #2 is TRUE.
else {
// statement(s) will execute if none of the above boolean expressions were TRUE.
}
# Practice with a complex if() statement
x <- c("what", "is", "truth")
# Build a complex cascade of statements looking for Truth
if("TRUTH" %in% x) {
print("TRUTH is found")
} ... ("Truth" %in% x) {
print ("Truth is found")
} ... { # notice the placement of else is directly after the closing }
print ("The truth is out there somewhere")
}
if() statements to generate system messages¶If/else statements can also be used to perform system-wide tasks, like generating a warning or breaking a code. For example, if we are writing a file to a directory and there is already a file with the same name, we should generate a warning or simply stop. Without the warning, the existing file will be silently overwritten.
# Check if our file exists
# Use dir() to return a vector of file names and then ask if any match ours.
if(sum(dir() == "Methane_mean_by_compound.csv") > 0) {
print("Stop! A file with that same name already exists")
} else {
# The file does not exist, print the go-ahead and save the file
print("No files with the same name. Good to go!")
}
write_csv(x = compound_mean, file = "Methane_mean_by_compound.csv", col_names = TRUE)
Challenge: Is there a cleaner way to produce our conditional?
Despite the warning in our code, the file in our example would still be overwritten. The call to write.csv() is outside the control flow of the conditional if()/else(). To fulfill our true intentions, we should move the placement of the write_csv() function so that it is under the direct influence of the control flow.
# Check if our file exists
if(...) {
print("Stop! A file with that same name already exists")
} else {
# The file does not exist, print the go-ahead and save the file
print("No files with the same name. Good to go!")
# Write the file as part of the same control statement
...
}
ifelse() is an effective control flow statement for simple tasks¶As we've seen a couple of time in lecture now, rather than making a large control flow block for simple tasks, we can supplement the ifelse() command as a way to contain all of our conditional statements and commands in one function. This is a much more powerful command than it appears to be as you can supply a set of vectors to this function as well!
ifelse(boolean_expression_vector, true_outcome_vector, false_outcome_vector)
Watch out for vector recycling! It's convenient for re-assigning values across vectors but note that we aren't performing any complex actions or response. Just assigning outcomes/values based on our evaluation expression.
# A simple example of ifelse()
rm(a)
i <- 8
ifelse(test = i < 5, yes = a <- 0, no = a <- 1)
a
# A complex vectorized example of ifelse()
i <- ...
ifelse(test = i < 5, yes = 0, no = 1) # Can we achieve this in a simpler way?
# Don't forget that we can quickly convert booleans to numeric!
...
There may be instances where you need to run loops on data until you find a certain piece of information, or until a specific condition is met rather than examining all of the elements within a set. There are two ways you can accomplish these "open-ended" loops.
while() loops run conditionally¶Unlike using for() loops which continue to execute until a specific iteration number, the while() loop executes a command as long as a conditional expression continues to evaluate as TRUE at each iteration. This conditional_expression must evaluate as TRUE to begin execution as well. The while() loop can be thought of as a special implementation of an if() statement that repeats over and over again until the conditional fails.
Let's work with some simple examples.
# Initialize our variable for conditional assessment
x <- 0
# Generate the while loop, incrementing x by 1 on each iteration, as long as x < 10
while(...) {
x <- x + 1
print(x)
}
# Loop will be ignored if the condition is FALSE and nothing gets printed
x <- ...
while(x < 10) {
x <- x + 1
print(x)
}
When programming a conditional loop you must always include a statement that alters the condition or breaks out of (coming up) the loop itself. It's also important to note the order or placement of when you alter the condition in your loops. All the command statements within the loop, unless otherwise specified, will execute before the re-evaluation of the conditional statement.
For example, a programmer is assigned a task: "While you're at the grocery store, buy some eggs". The programmer never came back home.
# Set your initial value
programmer <- " at the grocery store"
# Build your while loop
while(programmer == " at the grocery store") {
print("buy some eggs")
programmer <- ...
}
print(programmer)
# When do we provide the opportunity to change?
next and break to exit any kind of looping structure¶The explicit use of the next and break commands will break free from the current looping structure but each differs in what they do afterwards.
next command will exit the current iteration in the loop structure but will return to run the next iteration.break command will completely exit the loop structure, as if it had reached its natural end.Let's use the following examples to see how these mechanisms work.
# using next within our for loop
for(i in 1:10) {
if (i >= 5 & i <= 8) {
... # skips the next iteration of the loop
}
print(i)
}
i
# Using break
for(i in 1:10) {
if (i == 5) {
... # completely exits the loop
}
print(i)
}
i
repeat loops run endlessly unless specifically interrupted by break¶Unlike the while loop, which can end through the conditional being met, a repeat() loop has no explicit conditional statement built into it's formation. Instead, it will continue to repeat until it is broken out of by the break command.
# Using repeat() to endlessly loop
i = 1
repeat {
if (i == 20) {
break # completely exits the loop
}
print(i)
i = i + 1
}
i
Depending on the order in which you set up your conditionals, you may accidentally produce unexpected issues. It is best to consider the order in which you want to accomplish tasks within your loops before beginning the next iteration. This is especially relevant in the case of a conditional loop (while() or repeat) where you must include a variable that can eventually meet the conditions for exit.
# Using repeat() to demonstrate that conditional placement matters.
i = 1
# What numbers will this code print?
# What happens if we move the print command around?
repeat {
...
if (i == 20) {
break # completely exits the loop
}
print(i)
}
Depending on task you working on, perhaps there is already a function that satisfies your need so you don't have to use explicit for() loops. Make use of existing functions whenever you can because those have already been optimized to be fast and efficient.
Taking advantage of functions can allow you to keep your code clean rather than programming for loops to generate a simple number pattern.
for() loops with ggplot()¶Let's say, we are not ready to start making some plots for our manuscript, and we want to make individual plots for each salinity. The code below makes one plot for each salintiy level, i.e. brackish, fresh, or saline, depending on what salinity you pass on to the code.
library(repr)
# Note that the standard display size is a 7x7 inch space. let's double the width
options(repr.plot.width=21, repr.plot.height=7)
# An alternate way to create a new column in your data frame
# compounds_data$unique_identifier <- paste(compounds_data$salinity,
# compounds_data$compound,
# compounds_data$group,
# sep = "_")
# Use dplyr to help you do it instead! The code is slightly simpler
compounds_data <- compounds_data %>%
mutate(... = paste(salinity, compound, group, sep="_"))
head(compounds_data)
# we want one plot per salinity
# 1. Data
ggplot(compounds_data[compounds_data$salinity == "fresh", ]) +
# 2. Aesthetics
aes(x = ..., y= ...) +
# 4. Geoms
geom_line(aes(color = unique_identifier))
But what if I were to have, say, 25 salinity levels? In this case, a for loop will be the way to go. Take a look at the following code:
# instead of repeating that code three times, once for each salinity,
# we can just use a for loop that generates and writes one plot per salinity
# Loop through the possible types of salinity
for (...){
salinity_plot <-
# 1. Data
ggplot(compounds_data[compounds_data$salinity == i, ]) +
# 2. Aesthetics
aes(x = day, y = mean_methane) +
theme_grey() +
ggtitle("Mean methane counts per day") + # plot title
# 3. Scaling
scale_y_continuous("Methane (umoles)", limits=c(0, 200)) +
scale_x_continuous("Time (days)") +
# 4. Geoms
geom_line(aes(color = unique_identifier)) +
geom_point()
print(salinity_plot) # The only way to see the plot is to print it within a for loop
# Save each plot as it's generated
ggsave(plot = salinity_plot, filename = paste(i, "graph.png", sep = "_"), path = "data/" ,
scale=2, device = "png", units = c("cm"))
}
From above you can see that we can take advantage of our incrementing variables within the for loop. We can use it to help subset data, generate titles, and file names. You can use it in combination with other control statements to update the image as well! Just remember to avoid generating errors within your for() loop when access or altering data. Ensure you aren't trying to reference or alter data or subsets that do not exist due to missing information in your original datasets.
What if I want one plot for each compound and its sterile control in one page by salinity?
# instead of repeating that code three times, once for each salinity, we can just use a for loop that generates and writes one plot per salinity
for (i in unique(compounds_data$salinity)){
salinity_plot <-
# 1. Data
ggplot(compounds_data[compounds_data$salinity == i, ]) +
# 2. Aesthetics
aes(x = day, y = mean_methane) +
theme_grey() +
theme(panel.spacing = unit(2, "lines")) +
theme(strip.text = element_text(size=12, face="bold")) + #controls facet_wrap's title
ggtitle(paste("Mean methane counts per day based on ", ...,
" growth condition grouped by compound" , sep='')) + # plot title
# 3. Scaling
scale_y_continuous("Methane (umoles)", limits=c(0, 200)) +
scale_x_continuous("Time (days)") +
# 4. Geoms
geom_line(aes(color = unique_identifier)) +
geom_point() +
# 6. Facets
facet_wrap( ~ ... , ncol=2, nrow = 4, scales = 'free') # how many rows and columns for facet
# Only print the brackish dataset
if (i == "brackish") {
print(salinity_plot) # The only way to see the plot is to print it within a for loop
}
# Save all of the data regardless of set
ggsave(plot = salinity_plot, filename = paste(i, "graph.facet.png", sep = "_"), path = "data/" ,
scale=2, device = "png", units = c("cm"))
}
Yes! So far we've covered many ways of control flow but all of our programs have been moving in a linear direction from start to end. That is also just a consequence of working with a Jupyter notebook. Programs, however, are not necessarily run in a linear fashion.
What if you need to perform a set of similar instructions multiple times, at multiple points within your control flow? Perhaps it's even the same kind of for() loop on different sets of data? There are a lot of tricks like nested loops but you're better off knowing how to make functions that can be used in other code as well!
The general structure of a script or program can be divided into
A best practice when writing functions is the "Do One Thing" principle: each function should do one thing; one task. Instead of a big function, you can write several small ones per task, without going to the other extreme which would be fragmenting your code into a ridiculous amount of code snippets. By doing the one thing, your functions become:
Time to start writing our own functions.
While we have been using help() and ? to look up documentation on the various functions we've been using, our user-defined functions will not have any kind of accessible documentation. Of course if we were making specific packages for R we could create accessible documentation.
Regardless of this problem, it is best practice to document your functions much like you document the rest of your code. In this case you can include information such as:
function()¶In R, a function is declared with the following syntax:
function_name = function(parameter1_name, parameter2_name, ... parameter2_name = preset_value) {
# The specific code of your function goes within the {...}
return(output)
}
Let's convert our plotting code from above into a simple function!
# Description: This function, given a set of data from the compounds_data format will produce
# a faceted series of line plots for a specific salinity and it's associated compounds
# Input:
# data.df: a data frame at least with the following column names
# $salinity, $mean_methane, $unique_identifier, $compound
# salinity.val: a character/string used to subset data.df based filtering the $salinity column
# Output: make.facet.plot will generate a facet plot from data.df based on the salinity.val
# The plot will be saved to a file ending in "graph.facet.function.png"
make.facet.plot = function(...) {
# Make the plot
salinity_plot <-
# 1. Data
ggplot(data.df[data.df$salinity == salinity.val, ]) +
# 2. Aesthetics
aes(x = day, y = mean_methane) +
theme_grey() +
theme(panel.spacing = unit(2, "lines")) +
theme(strip.text = element_text(size=12, face="bold")) + #controls facet_wrap's title
ggtitle(paste("Mean methane counts per day based on ", salinity.val,
" growth condition grouped by compound" , sep='')) + # plot title
# 3. Scaling
scale_y_continuous("Methane (umoles)", limits=c(0, 200)) +
scale_x_continuous("Time (days)") +
# 4. Geoms
geom_line(aes(color = unique_identifier)) +
geom_point() +
# 6. Facets
facet_wrap( ~ salinity + compound, ncol=2, nrow = 4, scales = 'free') # how many rows and columns for facet
# print the plot
print(salinity_plot) # The only way to see the plot is to print it within a for loop
# save the plot
ggsave(plot = salinity_plot, filename = paste(salinity.val, "graph.facet.function.png", sep = "_"), path = "data/" ,
scale=2, device = "png", units = c("cm"))
}
Now that our subroutine is stored in memory, it can be called as we want! Maybe for different data sets as long as it meets the requirements set out in our description of the function itself. You can even build upon it to use control flow to decide if it will be faceted or not. The code between the two versions is so similar, you could break it into an if statement.
Let's try to use it right now.
# Use a for loop to iterate through salinity categories
for (i in unique(compounds_data$salinity)){
# Call on our function now
make.facet.plot(...)
}
return() command¶Some of your functions may generate subsets of data or results that you would like to further investigate for analysis. For example, when we generate our plots, perhaps we would like to also retrieve information like where the file was saved, along with the subset of data for each.
Using the return() command has two consequences:
A special note about the returned object. This can be any kind of object and if you want to return multiple objects, put them in a list()! Let's update our function.
# Description: This function, given a data from of the compounds_data format can produce a salinity plot
# that is faceted by compound type and save it to file.
# Input:
# data.df: a data frame at least with the following column names
# $salinity, $mean_methane, $unique_identifier, $compound
# salinity.val: a character/string used to subset data.df based filtering the $salinity column
# Output: make.facet.plot will generate a facet plot from data.df based on the salinity.val
# The plot will be saved to a file ending in "graph.facet.function.png"
# It will return
# [1] subset data
# [2] ggplot object
# [3] save plot filename
save.facet.plot = function(data.df, salinity.val) {
# Make the data subset
salinity.data <- data.df %>% filter(salinity == ...)
# Make the salinity plot
salinity_plot <-
# 1. Data
ggplot(salinity.data) +
# 2. Aesthetics
aes(x = day, y = mean_methane) +
theme_grey() +
theme(panel.spacing = unit(2, "lines")) +
theme(strip.text = element_text(size=12, face="bold")) + #controls facet_wrap's title
ggtitle(paste("Mean methane counts per day based on ", salinity.val,
" growth condition grouped by compound" , sep='')) + # plot title
# 3. Scaling
scale_y_continuous("Methane (umoles)", limits=c(0, 200)) +
scale_x_continuous("Time (days)") +
# 4. Geoms
geom_line(aes(color = unique_identifier)) +
geom_point() +
# 6. Facets
facet_wrap( ~ salinity + compound, ncol=2, nrow = 4, scales = 'free') # how many rows and columns for facet
# print the plot
# print(salinity_plot) # The only way to see the plot is to print it within a for loop
# generate a save file name
save.file = paste(salinity.val, "graph.facet.function.png", sep = "_")
# save the plot
ggsave(plot = salinity_plot, filename = save.file, path = "data/" ,
scale=2, device = "png", units = c("cm"))
#return the file name and data subset
return(list(...))
}
# Call on save.facet.plot function now
brackish.plot <- save.facet.plot(compounds_data, "...")
# Look at the data
head(brackish.plot ...)
# Display the plot to output
brackish.plot[[3]]
# What's the file name?
brackish.plot[[2]]
One of the things you can do as your functions and needs become more complex is to nest functions within other functions. We've already applied this when we call ggplot() functions within save.facet.plot().
The last helpful part of making functions it to consider providing default values for some of your arguments. In some cases you may have a subset of datasets that need to be treated differently so including an argument for your function to toggle certain behaviours is helpful. Including these arguments, however, means you have to define them every time you call on the function unless you assign a default value. Default values are only overridden by supplied arguments, otherwise these will be applied within your function.
Let's update our save.facet.plot() one last time to include a default salinity.
# Update the arguments to have a default value
save.facet.plot = function(data.df, salinity.val = "...") {
# Make the data subset
salinity.data <- data.df %>% filter(salinity == salinity.val)
# Make the salinity plot
salinity_plot <-
# 1. Data
ggplot(salinity.data) +
# 2. Aesthetics
aes(x = day, y = mean_methane) +
theme_grey() +
theme(panel.spacing = unit(2, "lines")) +
theme(strip.text = element_text(size=12, face="bold")) + #controls facet_wrap's title
ggtitle(paste("Mean methane counts per day based on ", salinity.val,
" growth condition grouped by compound" , sep='')) + # plot title
# 3. Scaling
scale_y_continuous("Methane (umoles)", limits=c(0, 200)) +
scale_x_continuous("Time (days)") +
# 4. Geoms
geom_line(aes(color = unique_identifier)) +
geom_point() +
# 6. Facets
facet_wrap( ~ salinity + compound, ncol=2, nrow = 4, scales = 'free') # how many rows and columns for facet
# print the plot
# print(salinity_plot) # The only way to see the plot is to print it within a for loop
# generate a save file name
save.file = paste(salinity.val, "graph.facet.function.png", sep = "_")
# save the plot
ggsave(plot = salinity_plot, filename = save.file, path = "data/" ,
scale=2, device = "png", units = c("cm"))
#return the file name and data subset
return(list(salinity.data, salinity_plot, save.file))
}
# rerun our function without a salinity type
brackish.plot <- save.facet.plot(compounds_data, "...")
brackish.plot[[2]]
While a rarer occurence, your user-defined functions can be used to instantiate and return a function itself. In these cases, the scoping of your variables can become a little trickier but variables within your code can be set using parameters from the initial function.
Let's start with a simple example before we return to our plot-saving function.
# Define our function(s)
make.power = function(...) { # This sets the variable values (via lexical scoping) of the exponent
pow = function(...) { # When we call on the resulting function it will require a base value
base^power # Make the actual calculation
}
}
# Define a new function that does cubic calculations
cube = make.power(...)
# Call on our cubic function using a base of 4
cube(...)
Now let's revisit our plot-saving function. We'll make a new plot-setting function that we can use to permanently set the data frame that is used when making plots. We can initialize this newly set function and save it as the function make.compound.plot().
# Step 1. Make our function inside a function
# Here we'll automatically set the dataframe to use for the plots
set.facet.plot = function(...) {
# Update the arguments to have a default value
save.facet.plot = function(salinity.val = "brackish") {
# Make the data subset
salinity.data <- data.df %>% filter(salinity == salinity.val)
# Make the salinity plot
salinity_plot <-
# 1. Data
ggplot(salinity.data) +
# 2. Aesthetics
aes(x = day, y = mean_methane) +
theme_grey() +
theme(panel.spacing = unit(2, "lines")) +
theme(strip.text = element_text(size=12, face="bold")) + #controls facet_wrap's title
ggtitle(paste("Mean methane counts per day based on ", salinity.val,
" growth condition grouped by compound" , sep='')) + # plot title
# 3. Scaling
scale_y_continuous("Methane (umoles)", limits=c(0, 200)) +
scale_x_continuous("Time (days)") +
# 4. Geoms
geom_line(aes(color = unique_identifier)) +
geom_point() +
# 6. Facets
facet_wrap( ~ salinity + compound, ncol=2, nrow = 4, scales = 'free') # how many rows and columns for facet
# print the plot
#print(salinity_plot) # The only way to see the plot is to print it within a for loop
# generate a save file name
save.file = paste(salinity.val, "graph.facet.function.png", sep = "_")
# save the plot
ggsave(plot = salinity_plot, filename = save.file, path = "data/" ,
scale=2, device = "png", units = c("cm"))
#return the file name and data subset
return(list(salinity.data, salinity_plot, save.file))
} # End save.facet.plot
}
# Step 2. Make a function where the data set is the compounds_data
make.compound.plot <- set.facet.plot(...)
# Make a plot and filter it by saline
saline.results <- make.compound.plot("...")
saline.results[[2]]
stop() function exits a function with a message¶Sometimes you might produce a function that could fail at a number of points for various reasons. While the R-kernel may simply produce a warning and proceed, you may wish to stop the function wherever it is rather than proceeding. Using the stop() function can help produce "controlled" error stopping points in your program. You can also include an optional message that will help to clarify why you have stopped the function.
First, however, let's produce a simple example of using the stop() function.
# Let's see what happens when we work with the log function
log10(1)
log10(0)
log10(...)
Suppose we aren't interested in producing -Inf or NaN values? We can build a wrapper around the log10 function with some conditional branching inside it.
get.log10 = function(x) {
if(x <= 0) ...
log10(x)
}
get.log10(1) # test our function
get.log10(-1) # Check it will stop when it's supposed to
get.log10(10) # Will this code run?
tryCatch() to identify errors without stopping¶In our above example of stop() the result of using it halts the execution of our code. Instead, sometimes we may wish to note an error has occured but we also want to proceed with the remainder of the code. In that case you can use the tryCatch() function which takes on a somewhat complex structure.
The tryCatch() function can be used to run an expression (or lines of code) and if an error or warning is produced, it can catch the result without halting your programs execution. Additional message information can be produced in each case so that the user can be warned of potential issues. Using tryCatch() takes the form of:
func_name = function(input) {
out <- tryCatch({ ## This is where we try code that might fail
expression(s) },
warning = function(condition) {
## statements to execute upon warning
message("Optional consolidated warning message")
return() # optional return value
},
error = function(condition) {
## statements to execute upon error
message("Optional consolidated error message")
return() # optional return value
},
finally = {
## Code to complete regardless of an error
}
) ## End of tryCatch
return(out)
}
Let's focus again on our plotting functions we produced. Previously our versions of save.facet.plot() included steps where the input was being filtered - sometimes by sub-functions that should just be producing a plot object. To remedy this we'll go back to our rule of "Do One Thing" and we'll generate make.facet.plot() so that it's sole purpose is to produce a plot when given salinity.data and a salinity.val.
# Simplify our main function which takes in pre-filtered data and plots it
make.facet.plot = function(salinity.data, salinity.val) {
# Make the salinity plot
salinity_plot <-
# 1. Data
ggplot(salinity.data) +
# 2. Aesthetics
aes(x = day, y = mean_methane) +
theme_grey() +
theme(panel.spacing = unit(2, "lines")) +
theme(strip.text = element_text(size=12, face="bold")) + #controls facet_wrap's title
ggtitle(paste("Mean methane counts per day based on ", salinity.val,
" growth condition grouped by compound" , sep='')) + # plot title
# 3. Scaling
scale_y_continuous("Methane (umoles)", limits=c(0, 200)) +
scale_x_continuous("Time (days)") +
# 4. Geoms
geom_line(aes(color = unique_identifier)) +
geom_point() +
# 6. Facets
facet_wrap( ~ salinity + compound, ncol=2, nrow = 4, scales = 'free') # how many rows and columns for facet
return(salinity_plot)
}
Next we want to generate a second function that will be able to filter a set like compounds_data, call on make.facet.plot(), and save the results as needed. In doing so we simplify the debugging process and it will help when we begin to incorporate a tryCatch() structure into our code.
# Make a function to filter data, make the plot, then save the plot
save.facet.plot = function(...) {
# filter the data
salinity.data <- filter(data.df, salinity == salinity.val)
# make the plotted data
salinity_plot <- make.facet.plot(salinity.data, salinity.val)
# generate a save file name
save.file = paste(salinity.val, "graph.facet.function.png", sep = "_")
# save the plot
ggsave(plot = salinity_plot, filename = save.file, path = "data/" ,
scale=2, device = "png", units = c("cm"))
#return the file name and data subset
return(list(salinity.data, salinity_plot, save.file))
}
# save a facet plot but look at the output of the plot
save.facet.plot(compounds_data, "saline")[[2]]
Here's where we need to get creative. What would happen inside save.facet.plot() if we happened to forget to supply a salinity.val parameter to our call? Previously we included a default value like "brackish" but we have no do so here. Using a call like save.face.plot(compounds_data) will produce an error.
# save a facet plot but but don't provide a salinity type
save.facet.plot(...)[[2]]
tryCatch() series to try and capture your error¶Instead of allowing the execution to halt when we reach an error maybe we can produce some messages and return a null value? In this implementation we will return a NULL value for the user to deal with.
# Make a function to filter data, make the plot, then save the plot
save.facet.plot = function(data.df, salinity.val) {
# filter the data
out <- ...({
# We'll try to make sure this catches any errors
salinity.data <- filter(data.df, salinity == salinity.val)
},
error = function(c) { # The actual error information is passed in as the variable "c"
# Assume the error occurs when no salinity.val is provided
message("Error: potentially missing parameter information")
return(NULL)
}) # End tryCatch
if (...) {
# Run the rest of our code
salinity.data <- filter(data.df, salinity == salinity.val)
# make the plotted data
salinity_plot <- make.facet.plot(salinity.data, salinity.val)
# generate a save file name
save.file = paste(salinity.val, "graph.facet.function.png", sep = "_")
# save the plot
ggsave(plot = salinity_plot, filename = save.file, path = "data/" ,
scale=2, device = "png", units = c("cm"))
#return the file name and data subset
return(list(salinity.data, salinity_plot, save.file))
}
else return(out) # if it's an error we'll get the NULL
}
# save a facet plot but look at the output of the plot
save.facet.plot(compounds_data, "brackish")[[2]]
# save a facet plot without any salinity type
save.facet.plot(compounds_data)[[2]]
tryCatch() to set values within your function¶Suppose instead of just returning a NULL value when we produce an error, we can change values on the user's behalf and continue? Of course our example here is in the context of an expected error and we can't always account for the nature of the error(s) we'll encounter. You could make things more complex and try to program some statements to determine the error type!
In our example, we'll try to anticipate the issue of a missing salinity value and "assume" that will be our only problem. We'll take advantage of the <<- scoping assignment operator. It will search the hierarchy of scopes until it can assign a value to the specified variable. This happens in place of R dynamically assigning a local variable.
Let's modify save.facet.plot() function so that our error handler can set the salinity.val variable within save.facet.plot().
# Make a function to filter data, make the plot, then save the plot
save.facet.plot = function(data.df, salinity.val) {
# filter the data
out <- tryCatch({
# We'll try to make sure this catches any errors
salinity.data <- filter(data.df, salinity == salinity.val)
},
error = function(c) {
# Assume the error occurs when no salinity.val is provided
message("No salinity value provided")
message("Substituting with 'brackish' value")
# Remember: we are in a mini function at the moment
# We need to go up a level and set salinity.val within the save.facet.plot function
...
# This will allow us to proceed with the rest of the code
})
# Now we no longer need to check the null state of our tryCatch
# Run the rest of our code
salinity.data <- filter(data.df, salinity == salinity.val)
# make the plotted data
salinity_plot <- make.facet.plot(salinity.data, salinity.val)
# generate a save file name
save.file = paste(salinity.val, "graph.facet.function.png", sep = "_")
# save the plot
ggsave(plot = salinity_plot, filename = save.file, path = "data/" ,
scale=2, device = "png", units = c("cm"))
#return the file name and data subset
return(list(salinity.data, salinity_plot, save.file))
}
# save a facet plot but look at the output of the plot
save.facet.plot(compounds_data)[[2]]
Here's an alternative version of our code that runs all of the code within the tryCatch call using the finally option.
# Make a function to filter data, make the plot, then save the plot
save.facet.plot = function(data.df, salinity.val) {
# filter the data
out <- tryCatch({
# We'll try to make sure this catches any errors
salinity.data <- filter(data.df, salinity == salinity.val)
},
error = function(c) {
# Assume the error occurs when no salinity.val is provided
message("No salinity value provided")
message("Substituting with 'brackish' value")
# Remember: we are in a mini function at the moment
# We need to go up a level and set salinity.val within the save.facet.plot function
salinity.val <<- "brackish"
# This will allow us to proceed with the rest of the code
},
...={
# Run the rest of our code
salinity.data <- filter(data.df, salinity == salinity.val)
# make the plotted data
salinity_plot <- make.facet.plot(salinity.data, salinity.val)
# generate a save file name
save.file = paste(salinity.val, "graph.facet.function.png", sep = "_")
# save the plot
ggsave(plot = salinity_plot, filename = save.file, path = "data/" ,
scale=2, device = "png", units = c("cm"))
#return the file name and data subset
return(list(salinity.data, salinity_plot, save.file))
})
}
# Test our function
save.facet.plot(compounds_data)[[2]]
Now that you have the basics, you can continue to build on complexity (or simplicity) as you need it.
While working within the R environment we've learned to manipulate data and save it's output as text or excel files. We've also learned to generate our own functions and save output as variables. When we create very useful functions and want to keep the code, there isn't a need to necessarily copy and paste it into every script we make either.
In this last section we will discover how we can import our own functions, save data objects, and load R workspaces into memory.
source()¶As a final extension of our control flow lesson, you already know about packages - these hold functions and data that are pre-made by others within the R community.
You don't need to make your own packages entirely but you can certainly make source files to keep functions and pertinent variables you may re-use in all of your analyses.
To access a saved "R" file which contains purely code, you can use the source() command. Let's try!
#?source
# Load data and information from another R script
source("...")
ls() to find variables and functions¶After loading your script into memory, you may want to see what is available in your environment's memory. The ls() command allows you to see what is available but it does not discriminate between objects or functions.
# See what variables and functions you have in memory
print(...)
lsf.str()¶As you can see from above, using ls() returns all of the objects currently saved in memory but also the functions we've previously declared and possibly some new ones imported from our call to source(). To see which functions we have loaded outside of those from packages in memory, we can use lsf.str(). Let's see what's new and try something out.
# To see which functions are available in memory
...
# Let's try a new function from "Lecture07.R"
...
# Look up newly added variables
codon_translation
# Use codonToAA on a single codon
codonToAA("AUA")
# Use codonToAA on multiple codons
codonToAA(c(...)) %>% str_flatten()
# Let's try surprise.class()
surprise.class("...")
save() objects or your whole kernel memory!¶From time to time you may have objects from analyses that aren't simply represented translated back as data tables or excel files. Perhaps you may want to save objects or plots from a complex analysis for later use. You can accomplish this with the save() command by providing a list of one or more objects to save.
print(ls())
save(brackish.plot, compound_mean, ..., file="./data/Lecture07.Rdata")
save.image() saves your entire workspace¶Sometimes you just want to save everything in memory. This may be a safeguard against accidental errors after running long aalyses. The same can be said about saving single objects but you may find this a useful command in the future.
# Save an image of everything
...(file="./data/Lecture07.all.RData")
load() .RData files into memory¶When you're finally ready to revisit your saved objects or memory, you'll want to restore them. It's as easy as using the command load(). Let's demonstrate, but first we need to clean up our current memory with rm()
# Clear memory
rm(list = ls())
# check that it's clear
print(ls())
# reload it all
...("./data/Lecture07.all.RData")
print(ls())
Let's review our time together. Over the span of this course we've discussed
dplyr packagetidyverse packageggplot2 packagestringrYou now have the tools to accomplish quite a few tasks and the foundation to grow your skills as needed. Let's run a final function together to celebrate!
# Time to run our final function together
...
There is no post-lecture assessment this week. Your DataCamp accounts will continue to remain active for another ~4 months during which time you can choose to explore the site's different courses. Please take advantage of this opportunity to keep growing your R skills!
However, we have created a post-course survey you can fill out anonymously. You can use this survey as an opportunity to tell us about your experience and help shape the future offerings of this series. Please take 5-10 minutes to fill out the survey. We really appreciate your feedback!
Your final project will be due two weeks after this lecture at 13:59 hours on Thursday November 11th. Please submit your final assignment as a single compressed file which will include:
Please refer to the marking rubric found in this courses root directory on JupyterHub for additional instructions.
You can build your Jupyter Notebooks on the UofT JupyterHub and save/download the files to your personal computer for compressing before submitting on Quercus.
Any additional questions can be emailed to me or the TAs or posted to the Discussion section of Quercus. Best of luck!
Revision 1.0.0: materials prepared in R Markdown by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.1.0: edited and preprared in Jupyter Notebook by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
This class is supported by DataCamp, the most intuitive learning platform for data science and analytics. Learn any time, anywhere and become an expert in R, Python, SQL, and more. DataCamp’s learn-by-doing methodology combines short expert videos and hands-on-the-keyboard exercises to help learners retain knowledge. DataCamp offers 350+ courses by expert instructors on topics such as importing data, data visualization, and machine learning. They’re constantly expanding their curriculum to keep up with the latest technology trends and to provide the best learning experience for all skill levels. Join over 6 million learners around the world and close your skills gap.
Your DataCamp academic subscription grants you free access to the DataCamp's catalog for 6 months from the beginning of this course. You are free to look for additional tutorials and courses to help grow your skills for your data science journey. Learn more (literally!) at DataCamp.com.